Evaluating Model Robustness to Image Quality Degradation in Histological Classification

Image 14

Published

May 30, 2025

Code
library(readr)
library(dplyr)
library(ggplot2)
library(purrr)
library(stringr)
library(cowplot)

metrics <- read_csv("../metrics/combined_report_metrics.csv")

1 Executive Summary

Automated image classification using computer vision has become an increasingly valuable tool in medical pathology to detect and classify cells within tissue samples, with some models achieving comparable performance to expert clinicians. However, these models are trained on high-resolution images that are costly to produce. In practice, histopathology images often suffer from loss of quality due to scanner blur, staining variability, or noise, such as when using Whole Slide Imaging (WSI). This study investigates how such degradations affect classification performance to identify robust architectures that best support clinical decision making.

To simulate lower-quality image data, Gaussian blur and noise were applied to Xenium Analyser-derived images during testing. We evaluated four models: ResNet50 (pretrained on ImageNet), a custom convolutional neural network (CNN), Random Forest, and XGBoost. The deep learning models processed raw image data to learn hierarchical spatial features, while traditional machine learning models relied on dimensionality-reduced input, with principal component analysis (PCA) selected over histogram of gradient (HOG) features after initial experimentation, to capture broader structure. The classification task involved identifying four cell types found in Eosin (H&E)-stained cancerous breast tissue: Tumour, Immune, Stromal, and Other. While all models showed a decline in performance as image quality decreased, machine learning models maintained relatively high performance, with negligible accuracy loss under added noise and only around a 10% decrease under severe blur. Notably, these models consistently detected over 80% of tumour cells across degradation levels, suggesting potential as screening tools in lower-quality imaging environments.

In contrast, deep learning models, while achieving over 70% overall accuracy on high-quality images, were more sensitive to image degradation. Their performance declined sharply under blur and noise, often defaulting to one or two dominant classes. These results highlight the need to align choice of model with expected input quality.

To help clinicians choose the most suitable model for image quality issues they may encounter, a Shiny app was developed to visualise how image degradation affects performance, including per-class, confidence, and stability. Users can upload their own images to enable an informed decision.

The experiment was designed for full reproducibility. All code and implementation instructions are available at https://github.com/AlanS812/data3888-14. Figure 1 outlines the experimental workflow, with corresponding Python scripts indicated for clarity and ease of use.

Figure 1: Diagram illustrating the experimental workflow and deployment pipeline. Each step is annotated with the corresponding Python script used, enabling full reproducibility of the process.

2 Background

Computer Vision models are being increasingly used to differentiate cell types in histopathology slides. With advancements in the field, some models have achieved classification accuracies as high as 98.5%1, exceeding the performance of experienced human pathologists2. However, a key concern remains: can these models maintain their performance when applied to typical clinical image data, which often varies in quality due to limitations in digital scanning and inconsistencies in slide preparation?

Although the cause of image degradation can vary widely, such as motion blur typical of MRIs3, this study is concerned with Whole Slide Imaging (WSI). WSI is a common technique of scanning and digitising full microscope slides, as opposed to in sections4. Though it greatly reduces processing time, it often introduces unintentional image degradations such as blur and noise, especially when focal points are misaligned due to the 3D nature of the tissue4. If a selected focal point is not representative of the slide’s variation in depth, surrounding areas may appear out of focus, reducing image clarity and obscuring high-frequency structural information such as texture and edges5.

Although increasing the number of focal points may somewhat mitigate the issue, it also increases processing time, inconveniencing patients and adding delays on overloaded labs4. Despite imaging technology advancing quickly, the associated expense means image quality issues are likely to persist. Therefore, understanding how reductions in image quality impact a model’s classification ability is vital for medical imagers and other clinicians, particularly in the context of cancer diagnosis, such as the breast cancer context of the current study6.

3 Method

Data was sourced from the Gene Expression Omnibus, a public repository of biomedical data. A Xenium Analyser was used to produce high resolution whole-slide images of breast tissue with a tumour present6, an alternative to the lower quality images typically generated by WSI. Individual cells were identified, cropped to an image, and labelled7. Though cell images with a pixel border of both 50 and 100 were provided, only the 100 image set was used due to their higher historical performance, and to balance computational constraints given the high volume of raw data provided8. All images originated from the same selected slide.

Images were grouped into classes based on their role in breast cancer progression and diagnosis. Tumour cells were grouped for their direct pathological relevance, immune cells for their similar functional responses to disease, and stromal cells for their structural role in the tumour environment. The “other” group included cells that didn’t clearly fit the main categories but were retained to help the model learn to distinguish diagnostically important cells from potentially less relevant ones.9,10

Each class was randomly and equally sampled to ensure the model learns from all biologically important categories, not just those most prevalent in the provided tissue. In real tissue, critical cells like Tumour or Immune types may be rare, so underrepresenting them could potentially weaken the model’s diagnostic ability. Unlabelled cell images were excluded to ensure consistent truth labels and avoid introducing noise into the training process.

To balance model performance with computational constraints, 20,000 of the given ~175,000 images were used, with 42% for training, 8% for validation, and 50% for testing. Although a typical data split allocates 80% for training, 10% for validation, and 10% for testing, the test set size was increased due to the high stakes nature of tumour cell classification, and thus the need for certainty in how a model will perform. Three separate test sets were used to ensure results were reproduced consistently. Due to natural variation in cell size, cropped images were not uniform, and were therefore resized to 224×224 pixels using Lanczos downsampling (Duchon 1979)11 to meet ResNet50’s input requirements.

To compare the robustness of different modelling approaches to image degradation, we trained two deep learning models and two classical machine learning models. Models were also selected for their suitability to high-dimensional medical image data. CNNs and ImageNet-pretrained ResNet50 were included for their ability to extract spatial features directly from pixel data, while Random Forest and XGBoost were selected for their robustness to high-dimensional, potentially redundant features, particularly when paired with dimensionality reduction techniques like HOG and PCA. All four have demonstrated success in medical image classification in previous studies1215.

Deep learning models were trained on pixel inputs normalised using ImageNet parameters to ensure comparability of results. For the machine learning models, three input formats were tested: raw pixels, HOG, and PCA16 Raw pixel input proved too computationally expensive, and HOG underperformed, likely due to its loss of colour and brightness (intensity) details crucial in stained histology images.

Compared to HOG, PCA improved base model performance and produced a more compact feature space. Each 224x224 image (~50,000 pixels) was flattened and reduced to 100 principal components, capturing ~60% of dataset variance. PCA was fit on training data only, and test images were transformed using the same components to avoid data leakage. These components can be visualised as image-like patterns that reflect global structure, aiding in interpretability, a crucial consideration in clinical settings.

Models were evaluated on a shared test set across varying levels of Gaussian blur and noise. Initially, small increments of noise and blur were used, then adjusted to larger intervals once performance patterns emerged to reduce computation. Blur was applied using Gaussian kernels from size 0 to 19, and noise ranged from 0 to 30, covering the full range from high-quality to heavily degraded images. Overall performance was measured using accuracy, weighted F1, precision, recall, and average maximum confidence. Class-level metrics included per-class precision, recall, F1, and average confidence, providing insight into both prediction reliability and stability under degradation.

4 Results

4.1 Machine vs Deep Learning

Code
blur_plot_data <- metrics %>%
  select(test_set, blur_size, noise_level, accuracy, Model_Label) %>%
  filter(Model_Label %in% c("RF (PCA)", "XGBoost (PCA)", "CNN", "ResNet")) %>%
  group_by(blur_size, noise_level, Model_Label) %>%
  summarise(accuracy = mean(accuracy), .groups = "drop") %>%
  mutate(
    noise_level = as.factor(noise_level),
    Model_Label  = factor(Model_Label)
  )

print(ggplot(blur_plot_data, aes(x = blur_size, y = accuracy, color = noise_level, group = noise_level)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  facet_wrap(~ Model_Label, ncol = 2) +
  labs(
    x = "Blur Radius",
    y = "Accuracy",
    color = "Noise Level"
  ) +
    scale_color_brewer(palette = "Blues") +
  theme_minimal(base_size = 13) +
  theme(
    strip.text = element_text(face = "bold", size = 12),
    legend.position = "right"
  ))
Figure 2: Accuracy vs Blur Radius by Noise Level, faceted by Model.

As shown in Figure 2, deep learning models achieved higher accuracy on unaltered images, but their performance dropped sharply with any form of degradation. In contrast, both machine learning models maintained relatively stable accuracy, weighted F1, recall, and precision across degradation levels.

Interestingly, deep learning models preserved high average precision even under heavy blur, but only when no noise was applied. Once any noise was introduced, they predominantly predicted the Immune and Stromal classes. Because Other and Tumour were almost never predicted, the average precision was artificially inflated. This limitation is explored further in the ‘Class Breakdown’ section.

4.1.1 Deep Learning

Noise causes the sharpest drop in accuracy for the CNN and ResNet50 models, with ResNet50 dropping to 30% after noise reaches a standard deviation of 1. The CNN tolerates noise up to level 3 before dropping sharply. Both models are highly effective in extracting spatial features from data - relating the information of several neighbouring pixels rather than individual values. Due to these close pixel relationships in the CNNs, it makes sense that slight changes to each pixel’s colour - and subsequently all surrounding pixels - would have a significant impact on the model’s performance. Mayer et al. (2022)17 similarly found that denoising software improved CNN performance.

Blur also caused a rapid performance drop, especially beyond kernel sizes of 9 pixels—though less severe than noise. This result is not unsurprising as even to the human eye, this level of blurring appears to be significant. This is consistent with the performance seen above with Gaussian noise applied, however to a less extreme extent as neighbouring pixels are not being altered using the same normally random distribution, rather as a function of their neighbouring colour channels. As a result the impact of blurring is less intense at low levels than noise, with only minor performance decreases.

Similar trends have been reported in other studies, where models trained on sharp images struggled to generalise to blurred inputs18

4.1.2 Machine Learning

Figure 3: Original and augmented images reconstructed after PCA dimension reduction, generated in pca.py.

On the other hand, although starting with lower initial accuracy, the machine learning models demonstrated greater robustness to image degradation. Both Random Forest and XGBoost showed minimal performance drops with blur kernel sizes up to 5 and were largely unaffected by noise levels up to 30.

The robustness of these models is most likely due to the use of PCA. Upon visual inspection of Figure 3, it is clear that applying PCA transforms images into a lower-dimensional representation that appears visually blurred and less noisy. Because PCA captures global patterns across the entire image, it effectively filters out small-scale noise. Comparatively, CNN will examine local patterns which could potentially be distorted by localised noise. In fact, PCA is commonly used as a denoising technique [20], which helps explain its contribution to model stability under degradation.

The machine learning models are slightly less robust to blur. As blur increases, the linear transformation can detect less structural information. However, even at extreme levels where it is uninterpretable to the human eye, the model can still extract colour, spatial and intensity information. Again, due to its global nature it is able to capture larger scale features, so it can continue extracting relevant patterns even with limited local information.

Therefore, while ResNet50 has the best classification performance with no augmentations, once image quality drops a simpler model like XGBoost may be more reliable.

4.2 Class Breakdown

Code
# need to fix caption here 

# parse confusion matrices
parse_cm <- function(cm_str) {
  nums <- as.numeric(unlist(str_extract_all(cm_str, "\\d+")))
  matrix(nums, nrow = 4, byrow = TRUE)
}

# average across test sets
get_avg_cm <- function(df, model_name, blur_val, noise_val) {
  df %>%
    filter(Model_Label == model_name, blur_size == blur_val, noise_level == noise_val) %>%
    pull(confusion_matrix) %>%
    map(parse_cm) %>%
    reduce(`+`) %>%
    `/`(3)  # divide by 3 test sets
}

plot_cm <- function(cm, title) {
  df <- as.data.frame(as.table(cm))
  colnames(df) <- c("True", "Predicted", "Freq")
  df$True      <- factor(as.integer(df$True),
                         levels=1:4,
                         labels=c("Immune","Other","Stromal","Tumour"))
  df$Predicted <- factor(as.integer(df$Predicted),
                         levels=1:4,
                         labels=c("Immune","Other","Stromal","Tumour"))
  ggplot(df, aes(x=Predicted, y=True, fill=Freq)) +
    geom_tile(color="white") +
    geom_text(aes(label=round(Freq, 1)), size = 4) +
    scale_fill_gradient(low="white", high="dodgerblue") +
    coord_fixed() +
    labs(title=title) +
        theme_minimal(base_size = 12) +
    theme(
      axis.title       = element_blank(),
      legend.position  = "none",
      plot.title       = element_text(hjust=0.5,
                                      size=12,
                                      face="bold",
                                      margin=margin(b=5)),
      plot.margin      = margin(t=5, r=5, b=5, l=5),

      #nlarge & rotate x‐labels so they don't collide
      axis.text.x      = element_text(
                           size = 10,
                           angle = 45,
                           hjust = 1,
                           vjust = 1,
                           margin = margin(t = 5)
                         ),
      #give y‐labels a bit more breathing room
      axis.text.y      = element_text(
                           size = 10,
                           margin = margin(r = 5)
                         )
    )
}

# Generate plots for all models and both augmentation levels
p1 <- plot_cm(get_avg_cm(metrics, "XGBoost (PCA)", 0, 0), "XGBoost – No Augmentation")
p2 <- plot_cm(get_avg_cm(metrics, "XGBoost (PCA)", 19, 30), "XGBoost – Max Augmentation")

p3 <- plot_cm(get_avg_cm(metrics, "ResNet", 0, 0), "ResNet – No Augmentation")
p4 <- plot_cm(get_avg_cm(metrics, "ResNet", 19, 30), "ResNet – Max Augmentation")

combined1 <- plot_grid(
  p1, p3, p2, p4,
  nrow = 2,
  align = "hv",
  axis  = "tblr"
)

# axis labels
labeled1 <- add_sub(combined1, "Predicted Class", vpadding = grid::unit(1, "lines"))
labeled1 <- ggdraw(labeled1) +
  draw_label("True Class", angle = 90, x = 0, y = 0.5, vjust = 1.5)

labeled1
Figure 4: True and Predicted label predictions averaged across test sets.

Looking to Figure 4, following image augmentation, ResNet50 predominantly predicted immune and stromal classes, while the CNN defaulted almost exclusively to stromal. Machine learning models initially achieved high precision for the immune class (~80%), which declined to ~60% under full augmentation, and tended to predict stromal and tumour. The ‘other’ class was rarely predicted by any model, likely due to its definition based on function rather than consistent visual features.

5 Application Development

Figure 5: Shiny application pages 1 to 3, larger pictures can be found in the appendix

The application in Figure 5 is designed to complement the real-world workflows of medical imaging professionals and diagnosticians, where visual assessment is central. By pairing augmented images with model performance metrics, it bridges the gap between human judgement and machine classification, supporting interdisciplinary decision-making.

The backend integrates two key pipelines: a pre-computed (but dynamically updatable) metrics pipeline, and pre-trained models that allow new predictions from user-uploaded images.

Page 1 displays example cell images from each class under user-selected blur and noise levels, alongside graphical performance summaries to allow quick model comparison under the augmentations applied.

Page 2 provides per-model class-level performance, including confidence and confusion matrices, supporting evaluation of model reliability and cell-type-specific performance as required.

Page 3 allows users to upload an image and view predictions across all augmentation combinations, helping assess model behaviour under novel, real-world image conditions.ns.

6 Discussion and Limitations

Blur Simulation: Blur was uniformly applied across images, unlike real-world cases where focus artefacts vary spatially. As image degradation in practice is more complex than simple augmentations, our simulation may not fully reflect real-world conditions. Future work could implement spatially localised or random blur for more realistic degradation.

Data Size: Due to computational limits, only a subset of data was used. Scaling to the full dataset may enhance accuracy and robustness, especially for deep learning models.

Sampling Strategy: Although class labels were balanced, cell subtypes were not, potentially introducing bias. More granular sampling could improve representation.

Domain Generalisability: The study focused on breast tissue. Results may not generalise to other tissues or modalities with different degradation patterns. Broader testing is needed to assess robustness.

PCA and Deep Learning Integration: PCA boosted robustness in machine learning models. Incorporating PCA-reconstructed inputs into deep learning may balance CNN accuracy with global feature stability.

Model Development: Augmentations were test-time only. Training with them could improve resilience. Techniques like denoising or deblurring might also help counter degradation effects.

7 Conclusion

Accurate identification of cell types in histological breast tissue is essential for effective, early cancer diagnosis and treatment planning. However, real-world slides often suffer from quality issues, such as the blur and noise introduced by WSI, a critical consideration when needing to detect the presence of critical classes like Tumour cells.

This study evaluated four classification models across varying levels of image degradation, simulated with blur and noise. They were tested on a four-class problem, distinguishing breast tissue cells based on their function in cancer. It was found that image quality had a considerable effect on both machine and deep learning model performance. While our ResNet50 had the best performance on unaugmented images, PCA based machine learning models achieved stable and predictable performances even under severe image quality degradation. This suggests the potential for use of simpler machine learning models in instances where histological images have low quality, but the importance of correct cell type identification remains high.

This exploration underscores the importance of aligning the choice of model with the expected image quality in real-world workflows of pathologists and histology-based diagnosticians. To support this, an interactive Shiny application was developed to visualise how models react to different levels of image quality, so that medical professionals can make an informed choice of model based on their image quality and diagnostic priorities.

8 Student Contribution

Elise wrote the shiny app script, deployed a reproducibility pipeline, trained and evaluated RF, researched cell groupings. Emily trained XGBoost, wrote the evaluation file for Resnet, CNN, XGBoost, wrote helper files/functions for shiny, made the presentation slides and schematic. Elise and Emily collated previous research and wrote report and speech drafts, then edited and formatted into QMD. Jason wrote the processing file, trained ResNet, wrote the initial methodology and results drafts and assisted with some observations. Alan trained the CNN, made initial presentation slides and did research on methodology. Barry did research on background and a few other methodology areas. Ye trained initial KNN and CNN models

9 Appendix

9.1 Shiny App

9.1.1 Page 1

9.2 Page 2

9.3 Page 3

References

1.
2.
3.
Luo, S. et al. Motion blur detection in radiographs. SPIE 6914, 100440 (2008).
4.
5.
Shakhawat, N. et al. Automatic quality evaluation of whole slide images for the practical use of whole slide imaging scanner. ITE Transactions on Media Technology and Applications 8, 252–268 (2020).
6.
7.
Ghazanfar, S. Preprocessing code. (2025).
8.
9.
Fridman, P. et el. The immune contexture in human tumours: Impact on clinical outcome. Nature Reviews Cancer 12, (2012).
10.
Kalluri, R. The biology and function of fibroblasts in cancer. Nature Reviews Cancer 16, (2016).
11.
Duchon, C. E. Lanczos filtering in one and two dimensions. Journal of Applied Meteorology and Climatology 18, 1016–1022 (1979).
12.
13.
Xu, et al., Fu. ResNet and its application to medical image processing: Research progress and challenges. Comput Methods Programs Biomed (2023) doi:https://doi.org/10.1016/j.cmpb.2023.107660.
14.
15.
Shivajirao Jadhav, S. Y. &. Deep convolutional neural network based medical image classification for disease diagnosis. Journal of Big Data 6, (2019).
16.
Mudrova, M. & Prochazka, A. PRINCIPAL COMPONENT ANALYSIS IN IMAGE PROCESSING.
17.
18.
19.
Deng, J. et al. Imagenet: A large-scale hierarchical image database. in 2009 IEEE conference on computer vision and pattern recognition 248–255 (Ieee, 2009).
20.
21.
Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining 785–794 (ACM, 2016). doi:10.1145/2939672.2939785.
22.
Brieman, L. Random forests. Machine Learning 45, 5–32 (2001).
23.
O’Shea, K. & Nash, R. An introduction to convolutional neural networks. (2015).
24.
Bakir, et al., Weston. Learning to find pre-images. in Advances in neural information processing systems 16 449–456 (Biologische Kybernetik; Max-Planck-Gesellschaft; MIT Press, 2004).
25.